Statistical Models for Unsupervised Prepositional Phrase Attachement
نویسنده
چکیده
We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains from raw text that is annotated with only part-of-speech tags and morphological base forms, as opposed to attachment information. It is therefore less resource-intensive and more portable than previous corpus-based algorithm proposed for this task. We present results for prepositional phrase attachment in both English and Spanish. 1 I n t r o d u c t i o n Prepositional phrase attachment is the task of deciding, for a given preposition in a sentence, the attachment site that corresponds to the interpretation of the sentence. For example, the task in the following examples is to decide whether the preposition with modifies the preceding noun phrase (with head word shirt) or the preceding verb phrase (with head word bought or washed). 1. I bought the shirt with pockets. 2. I washed the shirt with soap. In sentence 1, with modifies the noun shirt, since with pockets describes the shirt. However in sentence 2, with modifies the verb washed since with soap describes how the shirt is washed. While this form of attachment ambiguity is usually easy for people to resolve, a computer requires detailed knowledge about words (e.g., washed vs. bought) in order to successfully resolve such ambiguities and predict the correct interpretation. 1079 2 P r e v i o u s W o r k Most of the previous successful approaches to this problem have been statistical or corpusbased, and they consider only prepositions whose attachment is ambiguous between a preceding noun phrase and verb phrase. Previous work has framed the problem as a classification task, in which the goal is to predict N or V, corresponding to noun or verb attachment, given the head verb v, the head noun n, the preposition p, and optionally, the object of the preposition n2. For example, the (v, n,p, n2) tuples corresponding to the example sentences are 1. bought shirt with pockets 2. washed shirt with soap The correct classifications of tuples 1 and 2 are N and V, respectively. (Hindle and Rooth, 1993) describes a partially supervised approach in which the FIDDITCH partial parser was used to extract (v,n,p) tuples from raw text, where p is a preposition whose attachment is ambiguous between the head verb v and the head noun n. The extracted tuples are then used to construct a classifier, which resolves unseen ambiguities at around 80% accuracy. Later work, such as (Ratnaparkhi et al., 1994; Brill and Resnik, 1994; Collins and Brooks, 1995; Merlo et al., 1997; Zavrel and Daelemans, 1997; Franz, 1997), trains and tests on quintuples of the form (v,n,p, n2,a) extracted from the Penn treebank(Marcus et al., 1994), and has gradually improved on this accuracy with other kinds of statistical learning methods, yielding up to 84.5% accuracy(Collins and Brooks, 1995). Recently, (Stetina and Nagao, 1997) have reported 88% accuracy by using a corpus-based model in conjunction with a semantic dictionary. While previous corpus-based methods are highly accurate for this task, they are difficult to port to other languages because they require resources that are expensive to construct or simply nonexistent in other languages. We present an unsupervised algorithm for prepositional phrase a t tachment in English that requires only an part-of-speech tagger and a morphology database, and is therefore less resourceintensive and more portable than previous approaches, which have all required either treebanks or partial parsers. 3 U n s u p e r v i s e d P r e p o s i t i o n a l P h r a s e A t t a c h m e n t The exact task of our algorithm will be to construct a classifier cl which maps an instance of an ambiguous prepositional phrase (v, n, p, n2) to either N or V, corresponding to noun attachment or verb at tachment, respectively. In the full natural language parsing task, there are more than just two potential a t tachment sites, but we limit our task to choosing between a verb v and a noun n so that we may compare with previous supervised a t tempts on this problem. While we will be given the candidate attachment sites during testing, the training procedure assumes no a priori information about potential a t tachment sites. 3.1 Generat ing Training D a t a From R aw Text We generate training data from raw text by using a part-of-speech tagger, a simple chunker, an extract ion heuristic, and a morphology database. The order in which these tools are applied to raw text is shown in Table 1. The tagger from (Ratnaparkhi , 1996) first annotates sentences of raw text with a sequence of partof-speech tags. The chunker, implemented with two small regular expressions, then replaces simple noun phrases and quantifier phrases with their head words. The extract ion heuristic then finds head word tuples and their likely attachments from the tagged and chunked text. The heuristic relies on the observed fact that in English and in languages with similar word order, the a t tachment site of a preposition is usually located only a few words to the left of the preposition. Finally, numbers are replaced by a single token, the text is converted to lower case, and the morphology database is used to find the base forms of the verbs and nouns. The extracted head word tuples differ from the training data used in previous supervised attempts in an important way. In the supervised case, both of the potential sites, namely the verb v and the noun n are known before the attachment is resolved. In the unsupervised case discussed here, the extraction heuristic only finds what it thinks are unambiguous cases of prepositional phrase at tachment . Therefore, there is only one possible a t tachment site for the preposition, and either the verb v or the noun n does not exist, in the case of noun-at tached preposition or a verb-attached preposition, respectively. This extraction heuristic loosely resembles a step in the bootstrapping procedure used to get training data for the classifier of (Hindle and Rooth, 1993). In that step, unambiguous at tachments from the FIDDITCH parser 's output are initially used to resolve some of the ambiguous at tachments, and the resolved cases are iteratively used to disambiguate the remaining unresolved cases. Our procedure differs critically from (Hindle and Rooth, 1993) in that we do not iterate, we extract unambiguous attachments from unparsed input sentences, and we totally ignore the ambiguous cases. It is the hypothesis of this approach that the information in just the unambiguous at tachment events can resolve the ambiguous at tachment events of the test data. 3.1.1 H e u r i s t i c E x t r a c t i o n o f U n a m b i g u o u s Cases Given a tagged and chunked sentence, the extraction heuristic returns head word tuples of the form (v,p, n2) or (n,p, n2), where v is the verb, n is the noun, p is the preposition, n2 is the object of the preposition. The main idea of the extraction heuristic is that an attachment site of a preposition is usually within a few words to the left of the preposition. We extract :
منابع مشابه
Statistical Models for Unsupervised Prepositional Phrase Attachment
We present several unsupervised statistical models for the prepositional phrase attachment task that approach the accuracy of the best supervised methods for this task. Our unsupervised approach uses a heuristic based on attachment proximity and trains h'om raw text that is annotated with only part-oi;speech tags and morphologicM base forms, as opposed to attachment information. It is therefore...
متن کاملTowards the Automatic Learning of Idiomatic Prepositional Phrases
The objective of this work is to automatically determine, in an unsupervised manner, Spanish prepositional phrases of the type preposition nominal phrase preposition (P−NP−P) that behave in a sentence as a lexical unit and their semantic and syntactic properties cannot be deduced from the corresponding properties of each simple form, e.g., por medio de (by means of), a fin de (in order to), con...
متن کاملAn Unsupervised Model for Statistically Determining Coordinate Phrase Attachment
This paper examines the use of an unsupervised statistical model for determining the attachment of ambiguous coordinate phrases (CP) of the form n1 p n2 cc n3. The model presented here is based on [AR98], an unsupervised model for determining prepositional phrase attachment. After training on unannotated 1988 Wall Street Journal text, the model performs at 72% accuracy on a development set from...
متن کاملWeb-Based Model for Disambiguation of Prepositional Phrase Usage
We explore some Web-based methods to differentiate strings of words corresponding to Spanish prepositional phrases that can perform either as a regular prepositional phrase or as idiomatic prepositional phrase. The type of these Spanish prepositional phrases is preposition–nominal phrase–preposition (P−NP−P), for example: por medio de ‘by means of’, a fin de ‘in order to’, con respecto a ‘with ...
متن کاملResolving prepositional phrase attachment ambiguities in Spanish with a classifier
In this paper we present a classifier that solves a certain kind of ambiguities in syntactic structure for Spanish, namely, ambiguities as to the point of adjunction of a prepositional phrase in the syntactic structure of a sentence (PP attachment). As a starting point, we used EsTxala dependency grammar for Spanish, integrated within FreeLing, with an accuracy score of 61% on PP adjunction. Ou...
متن کامل